Model Selection

Image Captioning

# Image Captioning

Vit GPT2 Image Captioning

An image captioning model based on the ViT-GPT2 architecture, capable of generating natural language descriptions for input images.

Idefics3 8B Llama3

Idefics3 is an open-source multimodal model capable of processing arbitrary sequences of image and text inputs to generate text outputs. It shows significant improvements in OCR, document understanding, and visual reasoning.

Transformers English

This model is an image-to-text model, focusing on generating captions for images.

Image Generation

Kosmos 2 Patch14 24 Dup Ms

Kosmos-2 is a multimodal large language model capable of integrating visual information with language understanding to achieve image-to-text conversion and visual grounding tasks.

Kosmos 2 Patch14 224

Kosmos-2 is a multimodal large language model capable of understanding and generating text descriptions related to images, and establishing associations between text and image regions.

BLIP-2 is a vision-language model based on OPT-2.7b, which achieves image-to-text generation by freezing the image encoder and large language model while training a query transformer.

Transformers English

Kosmos 2 Patch14 224

Kosmos-2 is a multimodal large language model capable of grounding language models to real-world visual elements, supporting various vision-language tasks.

Blip2 Flan T5 Xxl

BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text tasks.

Transformers English

LanguageMachines

Swin Aragpt2 Image Captioning V3

An image captioning model based on Swin Transformer and AraGPT2 architecture, capable of generating textual descriptions for input images.

Blip2 Flan T5 Xl Sharded

This is a sharded version of the BLIP-2 model implemented with Flan T5-xl for image-to-text tasks such as image captioning and visual question answering. Sharding allows it to be loaded in low-memory environments.

Transformers English

BLIP-2 is a vision-language model based on OPT-6.7b, pretrained by freezing the image encoder and large language model, supporting tasks such as image-to-text generation and visual question answering.

Transformers English

Blip2 Flan T5 Xl

BLIP-2 is a vision-language model based on Flan T5-xl, pre-trained by freezing the image encoder and large language model, supporting tasks such as image captioning and visual question answering.

Transformers English

Textcaps Teste2

GIT is a Transformer-based image-to-text generation model trained on large-scale image-text pairs, capable of performing tasks such as image captioning and visual question answering.

Transformers Supports Multiple Languages

artificialguybr

GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens for image-to-text generation tasks

Transformers Supports Multiple Languages

GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens, designed for image-to-text generation tasks.

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase